[-empyre-] Opening post from Paul Koerbin, National Library of Australia - Preserving our online heritage

"Paul Koerbin" <pkoerbin@nla.gov.au> · Thu, 3 Feb 2005 09:31:17 +1100

This message is my opening post to this forum. I will just raise a few
of the issues we at the National Library of Australia confront in regard
to archiving web resources, in the hope of starting some discussion. The
National Library of Australia has been actively archiving Australian
online resources since late 1996. The results of this activity, and much
information about it, are found in the PANDORA Archive
http://pandora.nla.gov.au

The Library decided to take a selective approach to archiving web
resources since this approach was best able to produce results that met
the Library's charter to collect, describe and make available
Australia's documentary heritage. It is worth just emphasising the
second and third aspects of that mission since the other approach that
could have been adopted would be to try and collect everything, in which
describing the resources and making them available become much more
problematic. By being selective, we work on a much smaller scale than
whole domain harvesting and therefore we are able to describe
individually the resources we archive in full MARC catalogue records;
and, we are able to negotiate with publishers for permission to archive
and for permission to make the archived resource available through the
PANDORA portal - remember that in Australia legal deposit does not cover
electronic resources. The consequence of the approach that the Library
has taken is a functional, accessible web archive and has developed a
management system and routine procedures to build and maintain this
collection.

That said, it is obviously relevant to this forum to highlight just a
few of the significant problems faced by institutions such as the
National Library in undertaking web archiving. 

Firstly, the selection decision as to what to include in the archive, no
matter how well intentioned and how diligently pursued, will always
remain contentious. In the Library we are as aware of this as anyone.
While in the context of web archiving we have focused selection on
content, indeed content defined as unique and substantial, in other
areas of the Library's collection activities we aim to be comprehensive
and indeed for many years have built a vast collection of print ephemera
(i.e. content that is not substantial but considered of considerable
long-term research value). Nevertheless, if one adopts a selective
approach you also have make the selections. Clearly some areas are going
to be better represented in the Archive than others and therefore for
researchers in the future this can give a biased view of the nature of
the web in its early years. That said, researchers have always had to
deal with and interpret the primary resources that survive and are
available to them whether print, artefact or whatever.

Another problem with the selective approach, of concern to some
researchers, is that the focus on content means that the context is
lost. And it is true to say that the selective approach is a documentary
sort of approach, each resource considered in its own right and in
relative isolation. So, a fundamental and defining aspect of a web
resource, its hypertextuality and connection with other resources is, to
a larger degree, lost, at least in functional terms (it may be
documented in the code).

Another point I will raise in this opening message relates to the pros
and cons of the technical aspect of the web archiving process we
undertake at the Library. The PANDORA Archive depends primarily on
harvesting resources and this process is itself dependent upon the
technology to undertake this harvesting. We have used 3 or 4 different
harvesting softwares (currently the offline browsing software HTTrack)
to do this work. This technology basically acts as a browser and what we
actually capture for the Archive is what is delivered via a browser; so,
for example, dynamic content is rendered in the Archive as static
content. This is not fundamentally a problem for the Library since we
deal with "published" material and the Internet browser is essential to
the online publishing process. After all, that is what the reader or
user sees. This may however be an issue for the producer, especially the
producers of more imaginative, artistic and creative sites, who may see
the original files as essential to the integrity of their work.

Harvesting of web resources in this manner is also limited. There are
many  problems encountered with harvesting in respect of obtain a
complete and functional version of the resource. On the positive side,
working on the scale that we do we have been able to deal with many of
the these problems by manual intervention to "fix" the archived resource
to resemble the content, look and functionality or the original. However
this work, being manual and labour intensive, is not sustainable on
large scale archiving. But, while we can deal with many of the problems
encountered with harvesting there are also problems that are currently
intractable. Harvesting complex Flash sites that have embedded links is
currently not practical to do, yet this technology is widely used,
particularly by the more creative publishers on the web. Perhaps the
creators of these sites need to consider the archival requirements and
create archival versions of their works - these may be slightly
different from the versions published on the web. (Don't ask me how to
do this! I'm just throwing this point in to raise discussion. I wish I
did have the answers.)

Regards

Paul Koerbin
Supervisor
Digital Archiving Section
National Library of Australia

(02) 6262 1411
pkoerbin@nla.gov.au